07. K-means, Overview
K-Means Clustering
To perform population segmentation, one of our strategies will be to use k-means clustering to group data into similar clusters. To review, the k-means clustering algorithm can be broken down into a few steps; the following steps assume that you have n-dimensional data, which is to say, data with a discrete number of features associated with it. In the case of housing price data, these features include traits like house size, location, etc. features are just measurable components of a data point. K-means works as follows:
You select
k
, a predetermined number of clusters that you want to form. Then
k
points (centroids for k clusters) are selected at random locations in feature space.
For each point in your training dataset:
- You find the centroid that the point is closest to
- And assign that point to that cluster
- Then, for each cluster centroid, you move that point such that it is in the center of all the points that are were assigned to that cluster in step 2.
- Repeat steps 2 and 3 until you’ve either reached convergence and points no longer change cluster membership or until some specified number of iterations have been reached.
This algorithm can be applied to any kind of unlabelled data. You can watch a video explanation of the k-means algorithm, as applied to color image segmentation, below. In this case, the k-means algorithm looks at R, G, and B values as features, and uses those features to cluster individual pixels in an image!
Color Image Segmentation
K-means Clustering
Data Dimensionality
One thing to note is that it’s often easiest to form clusters when you have low-dimensional data. For example, it can be difficult, and often noisy, to get good clusters from data that has over 100 features. In high-dimensional cases, there is often a dimensionality reduction step that takes place before data is analyzed by a clustering algorithm. We’ll discuss PCA as a dimensionality reduction technique in the practical code example, later.